# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
# Read data from file into DF ld
ld_c = pd.read_csv('/Users/SeniKamara/BESTPICK/DATA_SCIENCE/DAND/PROJECT_5/PROJECT_5.5/PROJECT_SUBMISSION/prosperLoanData.csv')
ld_c.head()
# First we make copy of the original DF
ld = ld_c.copy()
# Looking at the overall shape of the DF
ld.shape
# Looking at the structure of the dataset
ld.info()
# Need to convert DF to Excel for visual assessment
from pandas import ExcelWriter
ld_300 = ld.head(300)
writer = ExcelWriter('LoanData.xlsx')
ld_300.to_excel(writer)
writer.save()
ld_300 table¶Loan Data from Prosper: This data set contains 113,937 loans with 81 variables on each loan.
Data columns (total 81 columns)
This information was very valuable for understanding which variables are "qualitative" and which are "quantitative". That inturn dictates which plot we're going to use for their exploration.
- There're 113,937 loans with 81 variables on each loan in the dataset. There are numeric variables such as:
BorrowerAPR, MonthlyLoanPayment, EmploymentStatusDuration, DebtToIncomeRatio, LoanOriginalAmount.
- And there're also categorical variables such as:
LoanStatus: Chargedoff, Defaulted, Past Due, Current, Completed;
IncomeRange: Not employed, Not displayed, USD 1-24,999, USD 25,000-49,999, USD 50,000-74,999, USD 75,000-99,999, USD 100,000+;
How the borrower's terms affect the loan terms?
What affects the borrower’s APR or interest rate?
Is there a relationship beteween employment, income, and the loan amount?
What are characteristics of the large original loan amounts?
What are risky characteristics of "bad" (delinquent) loans?
Decided to group and specify variables that I think will help support my investigation into my features of interest:
Loan parameters
Borrower’s parameters
Income
- DebtToIncomeRatio
- IncomeRange
- IncomeVerifiable
Borrowers parameters, qualitative variables¶Occupation# Setting plot size
fig = plt.figure(figsize = [8,18])
# Setting color and order
base_color = sb.color_palette()[0]
oc_counts = ld['Occupation'].value_counts()
oc_order = oc_counts.index
# get proportion taken by most common group for derivation of tick marks
n_ld = ld.shape[0]
max_count = ld['Occupation'].value_counts().max()
max_prop = max_count / n_ld
# generate tick marks locations and names
tick_props = np.arange(0, max_prop, 0.5)
tick_names = ['{:0.2f}'.format(v) for v in tick_props]
# Adding the relative parameters to the plot
plt.xticks(tick_props * n_ld, tick_names)
# Labeling the x axis
plt.xlabel('proportion');
plt.xticks(rotation = 90); #Rotating the labels 90 deg not to overlap
# Labelling the bars with ratio parameters
for i in range(oc_counts.shape[0]):
count = oc_counts[i]
pct_string = '{:0.1f}%'.format(100*count/n_ld)
plt.text(count+1, i, pct_string, va = 'center');
sb.countplot(data = ld, y = 'Occupation',
color = base_color, order = oc_order);
Occupation:¶Other makes the biggest category with 25.1%Ocupation category is Professional with 12%Computer Programmer with 3.9% Insight:
Professional - what do they do for a living?Question: What's the true meaning of the value: Professional? Why a quarter of all borrowers indicated Other for their occupation, and does it affect loan parameters?
IsBorrowerHomeowner# Checking for the different values in the IsBorrowerHomeowner column
ld.IsBorrowerHomeowner.value_counts()
# Checking for null values
ld.IsBorrowerHomeowner.isnull().sum()
IsBorrowerHomeowner:¶Homeowner and non-homeowners evenly devided between borrowers - 1000 difference.
Insight: Lack of homeonwership doesn't decrease chances of getting a loan
Question: Does homeownership affect the APR or the LA (Loan Amount)?
EmploymentStatus# Checking for the different values in the LoanStatus column
ld.EmploymentStatus.value_counts()
# Checking for null values
ld.EmploymentStatus.isnull().sum()
# Setting plot size
plt.figure(figsize = [8, 5])
# Setting color and order
base_color = sb.color_palette()[0]
es_counts = ld['EmploymentStatus'].value_counts()
es_order = es_counts.index
# get proportion taken by most common group for derivation of tick marks
n_ld = ld.shape[0]
max_count = ld['LoanStatus'].value_counts().max()
max_prop = max_count / n_ld
# Adding the relative parameters to the plot
plt.xticks(tick_props * n_ld, tick_names)
# generate tick marks locations and names
tick_props = np.arange(0, max_prop, 0.5)
tick_names = ['{:0.2f}'.format(v) for v in tick_props]
# Labeling the x axis
plt.xlabel('proportion');
plt.xticks(rotation = 90); #Rotating the labels 90 deg not to overlap
# Labelling the bars with ratio parameters
for i in range(es_counts.shape[0]):
count = es_counts[i]
pct_string = '{:0.1f}%'.format(100*count/n_ld)
plt.text(count+1, i, pct_string, va = 'center');
sb.countplot(data = ld, y = 'EmploymentStatus',
color = base_color, order = es_order);
EmploymentStatus:¶EmployedFull-timeSelf-employedNot-Available2255 entries with Null values
Insight: Around 4.7% of borrowers don't have a regular income.
Question: Is there a difference in loan terms for these 4.7%? And what is the difference between "Employed" and "Full-time"?
LoanStatus# Checking for the different values in the LoanStatus column
ld.LoanStatus.value_counts()
ld.LoanStatus.isnull().sum()
# Setting plot size
plt.figure(figsize = [8, 5])
# Setting color and order
base_color = sb.color_palette()[0]
ls_counts = ld['LoanStatus'].value_counts()
ls_order = ls_counts.index
# get proportion taken by most common group for derivation of tick marks
n_ld = ld.shape[0]
max_count = ld['LoanStatus'].value_counts().max()
max_prop = max_count / n_ld
# Adding the relative parameters to the plot
plt.xticks(tick_props * n_ld, tick_names)
# generate tick marks locations and names
tick_props = np.arange(0, max_prop, 0.5)
tick_names = ['{:0.2f}'.format(v) for v in tick_props]
# Labeling the x axis
plt.xlabel('proportion');
plt.xticks(rotation = 90); #Rotating the labels 90 deg not to overlap
# Labelling the bars with ratio parameters
for i in range(ls_counts.shape[0]):
count = ls_counts[i]
pct_string = '{:0.1f}%'.format(100*count/n_ld)
plt.text(count+1, i, pct_string, va = 'center');
sb.countplot(data = ld, y = 'LoanStatus',
color = base_color, order = ls_order);
# Combining categories of 'Past Due' values; also 'FinalPaymentInProgress' with 'Completed'
ld['LoanStatus'] = ld['LoanStatus'].replace('Past Due (1-15 days)', 'Past Due')
ld['LoanStatus'] = ld['LoanStatus'].replace('Past Due (31-60 days)', 'Past Due')
ld['LoanStatus'] = ld['LoanStatus'].replace('Past Due (61-90 days)', 'Past Due')
ld['LoanStatus'] = ld['LoanStatus'].replace('Past Due (91-120 days)', 'Past Due')
ld['LoanStatus'] = ld['LoanStatus'].replace('Past Due (16-30 days)', 'Past Due')
ld['LoanStatus'] = ld['LoanStatus'].replace('Past Due (>120 days)', 'Past Due')
ld['LoanStatus'] = ld['LoanStatus'].replace('FinalPaymentInProgress', 'Completed')
# Dropping the outliers 'Cancelled'
#df = df[df.line_race != 0]
ld = ld[ld.LoanStatus != 'Cancelled']
# Checking the code
ld.LoanStatus.value_counts()
# Setting plot size
plt.figure(figsize = [8, 5])
# Setting color and order
base_color = sb.color_palette()[0]
ls_counts = ld['LoanStatus'].value_counts()
ls_order = ls_counts.index
# get proportion taken by most common group for derivation of tick marks
n_ld = ld.shape[0]
max_count = ld['LoanStatus'].value_counts().max()
max_prop = max_count / n_ld
# Adding the relative parameters to the plot
plt.xticks(tick_props * n_ld, tick_names)
# generate tick marks locations and names
tick_props = np.arange(0, max_prop, 0.5)
tick_names = ['{:0.2f}'.format(v) for v in tick_props]
# Labelling the bars with ratio parameters
for i in range(ls_counts.shape[0]):
count = ls_counts[i]
pct_string = '{:0.1f}%'.format(100*count/n_ld)
plt.text(count+1, i, pct_string, va = 'center');
sb.countplot(data = ld, y = 'LoanStatus',
color = base_color, order = ls_order);
LoanStatus :¶4.5% are deliquent
Insight: 16.7% of all loans are bad business
Wrangling: Combined the "Past Due" loans. Combined 'FinalPaymentInProgress' with 'Completed' loans. Dropped 'Cancelled' loans. Organised into ordered categorical type
Question: Are there common characteristics among borrowers or their loan terms, that default? Are there similarities among borrowers or their loan terms, that repaid?
# Assess the values of 'IncomeVerifiable' column
ld.IncomeVerifiable.value_counts()
# Checking for Null entries
ld.IncomeVerifiable.isnull().sum()
IncomeVerifiable :¶About 8% of all borrowers can't verify their income
Insight: 8% of borrowers get loans withouth proving their income
Question: Are there common characteristics among borrowers or their loan terms, that can't verify income?
ListingCategory (numeric). Going to rename the variable to: LoanPurpose# Renaming the column
# df.rename(columns={"A": "a", "B": "c"})
ld.rename(columns={'ListingCategory (numeric)': 'LoanPurpose'}, inplace = True)
# Testing code
ld.LoanPurpose[3:5]
ld.LoanPurpose.value_counts()
# Setting the size of the plot
fig = plt.figure(figsize = [8,18])
# Setting color and order
base_color = sb.color_palette()[0]
lp_counts = ld['LoanPurpose'].value_counts()
lp_order = lp_counts.index
# get proportion taken by most common group for derivation of tick marks
n_ld = ld.shape[0]
max_count = ld['LoanPurpose'].value_counts().max()
max_prop = max_count / n_ld
# generate tick marks locations and names
tick_props = np.arange(0, max_prop, 0.5)
tick_names = ['{:0.2f}'.format(v) for v in tick_props]
# Adding the relative parameters to the plot
plt.xticks(tick_props * n_ld, tick_names)
# Labelling the bars with ratio parameters
for i in range(lp_counts.shape[0]):
count = lp_counts[i]
pct_string = '{:0.1f}%'.format(100*count/n_ld)
plt.text(count+1, i, pct_string, va = 'center');
sb.countplot(data = ld, y = 'LoanPurpose',
color = base_color);
LoanPurpose :¶Purposes: 2 and 3 are in fourth place with 6.5% and 6.3% respectively
Insight: Very big gap between pupose 1 and the other purposes.
Question: What is purpose 1 and why there is such a big gap?
IncomeRange# Exploring values of the 'IncomeRange' variable
ld.IncomeRange.value_counts()
# Combining "$0" and "Not empoloyed" values
ld['IncomeRange'] = ld['IncomeRange'].replace('$0', 'Not employed')
ld.IncomeRange.value_counts()
ld.LoanStatus.value_counts()
# convert LoanStatus and IncomeRange into ordered categorical types
ordinal_var_dict = {'LoanStatus': ['Chargedoff', 'Defaulted', 'Past Due', 'Current', 'Completed'],
'IncomeRange': ['Not employed', 'Not displayed', '$1-24,999', '$25,000-49,999',
'$50,000-74,999', '$75,000-99,999', '$100,000+']}
for var in ordinal_var_dict:
ordered_var = pd.api.types.CategoricalDtype(ordered = True,
categories = ordinal_var_dict[var])
ld[var] = ld[var].astype(ordered_var)
plt.figure(figsize = [8,5])
#ax = fig.add_axes([1, 5, 2, 10])
# Setting color and order
base_color = sb.color_palette()[0]
ir_counts = ld['IncomeRange'].value_counts()
#ir_order = ir_counts.index
# get proportion taken by most common group for derivation of tick marks
n_ld = ld.shape[0]
max_count = ld['IncomeRange'].value_counts().max()
max_prop = max_count / n_ld
# Adding the relative parameters to the plot
plt.xticks(tick_props * n_ld, tick_names)
# generate tick marks locations and names
tick_props = np.arange(0, max_prop, 0.5)
tick_names = ['{:0.2f}'.format(v) for v in tick_props]
# Labelling the bars with ratio parameters
for i in range(ir_counts.shape[0]):
count = ir_counts[i]
pct_string = '{:0.1f}%'.format(100*count/n_ld)
plt.text(count+1, i, pct_string, va = 'center');
sb.countplot(data = ld, y = 'IncomeRange',
color = base_color);
IncomeRange :¶Insight: Income ranges: USD 100,000+ and USD 75,000-99,999 together make up 30% of all loans.
Wrangling: Convert the income ranges into categorical type. Combined '0' and 'Not employed'
Question: Any differences in the loan parameters between the lower ranges and the higher ranges? Any similarities in the loan parameters within the ranges?
Term# Exploring the values of the 'Term' variable
ld.Term.value_counts()
ld.Term.isnull().sum()
labels = '60 months', '36 months', '12 months'
sizes = [24545, 87778, 1614]
colors = ['gold', 'yellowgreen', 'lightcoral']
plt.title('Loan Term (months)')
# Plot
plt.pie(sizes, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=140)
plt.axis('equal')
plt.show()
Term :¶Only 1.4% of all loans are taken/given for one year
Insight: Lender specializes on commercial loans with 1yr., 3yr. and 5yr terms.
Question: Any differences between 3yr. and 5yr loans? Any similarities within 3yr and 5yr loans?
LoanOriginalAmountld.LoanOriginalAmount.isnull().sum()
# Plotting LoanOriginalAmount
bins = np.arange(0, ld['LoanOriginalAmount'].max()+1000, 1500)
plt.title('Original Loan Amount')
plt.xlabel('Loan Amount in USD')
plt.ylabel('Count')
plt.hist(data = ld, x = "LoanOriginalAmount", bins = bins);
# df.columns.get_loc("pear")
ld.columns.get_loc('LoanOriginalAmount')
# Alternative slicing to avoid error, none of which work
# x[(slice(None),1)]
# data.ix[:3, :'pop']
# data.loc[:'Illinois', :'pop']
sb.distplot(ld['LoanOriginalAmount']);
# Organizing the bins for better visualisation
bin_edges = np.arange(2500, 35000+5000, 5000)
plt.hist(ld['LoanOriginalAmount'] , bins = bin_edges, rwidth = 0.7)
plt.xticks(np.arange(5000, 35000+2500, 5000))
plt.xlabel('Loan Amount ($)')
plt.ylabel('Number of Loans');
LoanOriginalAmount :¶Graph also shows that most of the loans are taken/given in USD 5000 increments: USD 5K, USD 10000, USD 15000, USD 20000, USD 25000, USD 30000 and USD 35000
Insight: Lender specializes on commercial loans with, loan values (LV): USD 5K, USD 10K, USD 20K, USD 25K, USD 30K, and USD 35K
Question: Is there a difference or similarities in borrowers and loan terms according to LV?
BorrowerAPRld.BorrowerAPR.isnull().sum()
# Dropping the null values
# df.dropna(subset=['name', 'born'])
ld.dropna(subset=['BorrowerAPR'], inplace=True)
ld.BorrowerAPR.isnull().sum()
# Plotting histogram with variable 'BorrowerAPR'
# Organizing the bins for better visualisation
bin_edges = np.arange(0.05, 0.4+0.1, 0.1)
plt.hist(ld['BorrowerAPR'] , bins = bin_edges, rwidth = 0.7)
plt.xticks(np.arange(0.1, 0.4+0.05, 0.1))
plt.xlabel('Borrower APR')
plt.ylabel('Number of Loans');
BorrowerAPR :¶The least quantity of loans in the portfolio have 0.4% APR.
Insight: Quantity of loans with 0.4% is considerably small compared to the other rates.
Wrangling: Dropped null values.
Question: Why is there such a big gap between the loans with 0.4% APR and the other rates, considering that 0.1% difference between rates is not that significant?
MonthlyLoanPaymentld.MonthlyLoanPayment.isnull().sum()
# Plotting histogram with variable 'MonthlyLoanPayment'
plt.hist(ld['MonthlyLoanPayment']);
plt.xlabel = 'Monthly payments (in USD)'
plt.ylabel = 'Number of Loans'
# Organizing the bins for better visualisation
plt.figure(figsize = [10, 5])
bin_edges = np.arange(50, 1250+100, 100)
plt.xticks(np.arange(100, 1250+100, 100))
plt.xlabel = 'Monthly payments (in USD)'
plt.ylabel = 'Number of Loans'
plt.hist(ld['MonthlyLoanPayment'] , bins = bin_edges, rwidth = 0.7);
# Converting the xscale and ticks for better readability
bin_edges = 10 ** np.arange(0.8, np.log10(ld.MonthlyLoanPayment.max())+0.1, 0.05)
plt.hist(ld.MonthlyLoanPayment, bins = bin_edges)
plt.xlabel = 'Monthly payments (in USD)'
plt.ylabel = 'Number of Loans'
plt.xscale('log')
tick_locs = [10, 30, 100, 300, 500, 1000, 3000]
plt.xticks(tick_locs, tick_locs);
MonthlyLoanPayment :¶The plot shows bimodal distribution with picks at USD 160 and USD 400 monthly payments.
Insight: A very large amount of loans is generating over USD 400 monthly payments.
Exploration: looked at the data using a log transform. Under the transformation, the data looked bimodal, with one peak at USD 160 and, and another at USD 400.
Question: Do smaller payments and therefore smaller loans have less delinquent Loan Statuses?
EmploymentStatusDurationld.EmploymentStatusDuration.isnull().sum()
ld_em =ld.dropna(subset=['EmploymentStatusDuration'])
ld_em.EmploymentStatusDuration.isnull().sum()
# Plotting histogram with variable 'EmploymentStatusDuration'
plt.hist(data = ld, x = 'EmploymentStatusDuration');
The plot is scewed to the right with a long tail, so we're going to use log scale to make sense of the outliers.
np.log10(ld['EmploymentStatusDuration'].describe())
# Plotting the variable on the log scale
plt.figure(figsize = [8, 5])
bin_edges = 10 ** np.arange(-1, 5+0.1, 0.1)
plt.hist(data = ld, x = 'EmploymentStatusDuration' , bins = bin_edges)
plt.xscale('log');
Obvious outliers on the left side of the distribution
# Converting the xscale and ticks for better readability
plt.figure(figsize = [8, 5])
bin_edges = 10 ** np.arange(0.8, np.log10(ld.EmploymentStatusDuration.max())+0.1, 0.1)
plt.hist(ld.EmploymentStatusDuration, bins = bin_edges)
plt.xscale('log')
#It is important that the xticks are specified after xscale since that
#function has its own built-in tick settings.
tick_locs = [10, 30, 100, 300, 1000, 3000]
plt.xticks(tick_locs, tick_locs);
# Select low outliers
low_outliers = (ld_em['EmploymentStatusDuration'] < 12)
print(low_outliers.sum())
# Select high outliers
high_outliers = (ld_em['EmploymentStatusDuration'] > 300)
print(high_outliers.sum())
EmploymentStatusDuration :¶The plot shows that most borrowers have: 50 to 200 months or 4 to 16 years of employment history.
Insight: Most of the borrowers have a solid employment history. 13721 loans are given to borrowers with less than 12 months of employment.
Question: What are the loan terms for the outliers, especially for those with less than 12 months of employment history?
DebtToIncomeRatio# Checking for nulls in DI
ld.DebtToIncomeRatio.isnull().sum()
# Checking for nulls in ld_em (dropped EMP nulls)
ld_em.DebtToIncomeRatio.isnull().sum()
# Dropping the nulls in DI
ld_em_di = ld.dropna(subset=['DebtToIncomeRatio'])
# Checking code
ld_em_di.DebtToIncomeRatio.isnull().sum()
ld_em_di['DebtToIncomeRatio'].describe()
Very unusual value: max = 10; How can a ratio of Debt to Income be 10, that means the debt burden is 10 times higher than income. How is that possible?
# Plotting histogram with variable 'DebtToIncomeRatio'
plt.hist(data = ld_em_di, x = 'DebtToIncomeRatio')
plt.xlim(0, 3);
# Organizing the bins for better visualisation
plt.figure(figsize = [8, 6])
bin_edges = np.arange(0.05, 1+0.15, 0.1)
plt.hist(ld_em_di['DebtToIncomeRatio'] , bins = bin_edges, rwidth = 0.6)
plt.xticks(np.arange(0.1, 1+0.15, 0.1));
# Converting the xscale and ticks for better readability
bin_edges = 10 ** np.arange(0.01, np.log10(ld.DebtToIncomeRatio.max())+0.1, 0.1)
plt.hist(ld_em_di.DebtToIncomeRatio, bins = bin_edges)
plt.xscale('log')
#It is important that the xticks are specified after xscale since that
#function has its own built-in tick settings.
tick_locs = [0.5, 1, 2, 5, 10, 15, 20]
plt.xticks(tick_locs, tick_locs);
# Select risky loans
risky_ratio = (ld_em_di['DebtToIncomeRatio'] > 0.6)
print(risky_ratio.sum())
# Select more_risky loans - more than 80% of income
more_risky = (ld_em_di['DebtToIncomeRatio'] > 0.8)
print(more_risky.sum())
# Select very_risky loans - more than 100% of income
very_risky = (ld_em_di['DebtToIncomeRatio'] > 1)
print(very_risky.sum())
Loans with "DebtToIncomeRatio" more than 60% are riskier than the ones with a smaller ratio.
Insigt: It will be good to check the "LoanStatus" and other characteristics of these riskier loans
This is an area of special attention. Debt Income ratio more than "1" means that the borrower has more debts than income, and that is problematic situation!
DebtToIncomeRatio :¶The plot shows that most loans have Debt to Income ratio between 0.1 and 0.2.
Insight: Very unusual outliers showing a large number of loans with a very high Income to Debt ratio 1205 loans > 0.8 DI ratio, but 799 loans > 1 DI ratio
Wrangling: Dropped 8424 null values.
Exploring: Did log scale transformation to identify outliers
Question: What are the loan terms for outliers with 0.8 Debt to Income ratio?
StatedMonthlyIncome# Exploring values of 'StatedMonthlyIncome'
ld.StatedMonthlyIncome
ld.StatedMonthlyIncome.describe()
# Converting floats to int
# df.col = df.col.astype(int)
ld.StatedMonthlyIncome = ld.StatedMonthlyIncome.astype(int)
# Testing
ld.StatedMonthlyIncome.dtype
# plotting carat on a standard scale
binsize = 1000
bins = np.arange(0, ld['StatedMonthlyIncome'].max()+binsize, binsize)
plt.figure(figsize=[8, 5])
plt.hist(data = ld, x = 'StatedMonthlyIncome', bins = bins)
plt.xlim([0,40000])
plt.show()
Right skewed plot with a long tail.
# Converting the xscale and ticks for better readability
plt.figure(figsize=[10, 7])
bin_edges = 10 ** np.arange(0.8, np.log10(ld.StatedMonthlyIncome.max())+0.1, 0.1)
plt.hist(ld.StatedMonthlyIncome, bins = bin_edges)
plt.xscale('log')
#It is important that the xticks are specified after xscale since that
#function has its own built-in tick settings.
tick_locs = [100, 500, 2000, 5000, 20000]
plt.xticks(tick_locs, tick_locs);
StatedMonthlyIncome :¶The plot shows normal distribution with the mean income at 5000
Insight: USD 5000 monthly income for the majority of borrowers corresponds with the findings in the IncomeRange column where the mean annual income falls in the range of USD 25000 - USD 75000.
Term :¶Only 1.4% of all loans are taken/given for one year
Insight: Lender specializes on commercial loans with 1yr., 3yr. and 5yr terms.
Question: Any differences between 3yr. and 5yr loans? Any similarities within 3yr and 5yr loans?
Term :¶Only 1.4% of all loans are taken/given for one year
Insight: Lender specializes on commercial loans with 1yr., 3yr. and 5yr terms.
Question: Any differences between 3yr. and 5yr loans? Any similarities within 3yr and 5yr loans?
IsBorrowerHomeowner:¶Homeowner and non-homeowners evenly devided between borrowers - 1000 difference.
Insight: Lack of homeonwership doesn't decrease chances of getting a loan
Question: Does homeownership affect the APR or the LA (Loan Amount)?
EmploymentStatus:¶EmployedFull-timeSelf-employedNot-AvailableInsight: 4.7% of borrowers don't have a regular income.
Question: Is there a difference in loan terms for these 4.7%? And what is the difference between "Employed" and "Full-time"?
IncomeVerifiable :¶About 8% of all borrowers can't verify their income
Insight: 8% of borrowers get loans withouth proving their income
Question: Are there common characteristics among borrowers or their loan terms, that can't verify income?
LoanStatus :¶4.5% are deliquent
Insight: 16.7% of all loans are bad business
Wrangling: Combined the "Past Due" loans. Combined 'FinalPaymentInProgress' with 'Completed' loans. Dropped 'Cancelled' loans. Organised into ordered categorical type
Question: Are there common characteristics among borrowers or their loan terms, that default? Are there similarities among borrowers or their loan terms, that repaid?
IncomeRange :¶Insight: Income ranges: USD 100,000+ and USD 75,000-99,999 together make up 30% of all loans.
Wrangling: Convert the income ranges into categorical type. Combined '0' and 'Not employed'
Question: Any differences in the loan parameters between the lower ranges and the higher ranges? Any similarities in the loan parameters within the ranges?
LoanOriginalAmount :¶Graph also shows that most of the loans are taken/given in USD 5000 increments: USD 5K, USD 10000, USD 15000, USD 20000, USD 25000, USD 30000 and USD 35000
Insight: Lender specializes on commercial loans with, loan amounts (LA): USD 5K, USD 10K, USD 20K, USD 25K, USD 30K, and USD 35K
Question: What variables affect the LA?
BorrowerAPR :¶The least quantity of loans in the portfolio have 0.4% APR.
Insight: Quantity of loans with 0.4% is considerably small compared to the other rates.
Wrangling: Dropped null values.
Question: Why is there such a big gap between the loans with 0.4% APR and the other rates, considering that 0.1% difference between rates is not that significant?
EmploymentStatusDuration :¶The plot shows that most borrowers have: 50 to 200 months or 4 to 16 years of employment history.
Insight: Most of the borrowers have a solid employment history. 13721 loans are given to borrowers with less than 12 months employment history.
Question: What are the loan terms for the outliers, especially for those with less than 12 months or no employment history?
MonthlyLoanPayment:¶The plot shows bimodal distribution with picks at USD 160 and USD 400 monthly payments.
Insight: A very large amount of loans is generating over USD 400 mothly payments.
Question: Do smaller payments and therefore smaller loans have less delinquent Loan Statuses?
DebtToIncomeRatio :¶The plot shows that most loans have Debt to Income ratio between 0.1 and 0.2.
Insight: Very unusual outlier showing a large number of loans with a very high Income to Debt ratio - close to 1
Wrangling: Dropped 8424 null values.
Exploring: Did log scale transformation to identify outliers. 1205 loans have 0.8 ratio or higher
Question: What are the loan terms for outliers with 0.8 Debt to Income ratio?
MonthlyLoanPaymentvariable: When used a regular plot, there was a long tail with outliers. For further exploration log scale was applied.
EmploymentStatusDurationvariable: Regular plot revealed right scew with outliers. Log scale was applied to examine the outliers closer.
DebtToIncomeRatiovariable: Dropped 8424 null values. Regular plot made no sense. Had to applied log scale to examine the distribution.
LoanStatusvariable: removed the "bucket" for all "Past Due" loans. This way we have 3 delinquent values: Chargedoff, Defaulted, Past Due. Combined 'FinalPaymentInProgress' with 'Completed' loans. Dropped 'Cancelled' loans. Converted into ordered categorical type.
IncomeRangevariable: Converted the income ranges into categorical type. Combined '0' and 'Not employed' values.
BorrowerAPRvariable: Dropped null values.
MonthlyLoanPaymentvariable: Looked at the data using a log transform. Under the transformation, the data looked bimodal, with one peak at USD 160 and, and another at USD 400.
EmploymentStatusDurationvariable: Log scale revealed outliers to the left of the mean and programatic analysis proved that there 13721 loans with less than 12 months employment history.
DebtToIncomeRatiovariable: Log scale transformation clearly showed outliers at the upper end of the ratios. 1205 loans have 0.8 ratio or higher.
In this section we will investigate relationships between pairs of variables in our data that were introduced in the previous section (univariate exploration).
Categorical:
Numeric:
# Select high outliers
outliers = (ld_em_di['DebtToIncomeRatio'] >= 0.6)
print(outliers.sum())
# Dropping loans where DIR > 1
ld_di = ld_em_di[ld_em_di['DebtToIncomeRatio'] < 1]
# Checking that the outliers where dropped
ld_di['DebtToIncomeRatio']
# Assign ld_di DF to a new "Univariate Explored" DF - ld_u for further explorations
ld_u = ld_di
# Setting up the numeric variables
numeric_vars = ['LoanOriginalAmount', 'BorrowerAPR', 'EmploymentStatusDuration', 'MonthlyLoanPayment', 'DebtToIncomeRatio']
# correlation plot
plt.figure(figsize = [8, 5])
sb.heatmap(ld_em_di[numeric_vars].corr(), annot = True, fmt = '.3f',
cmap = 'vlag_r', center = 0)
plt.show()
Negative correlation between APR and MP, because of negative correlation b/w APR and LA. Small LA has a bigger APR (b/c risk), and because of the LA size, the MP is small. The opposite is true!
The negative relationship b/w APR and LA (-0.3) doesn't translate into a negative relationship between EMP and APR (0.004), logically higher EMP should yeild lower APR.
Very strange that, there is no strong positive correlation b/w EMP and LA (0.09), Even though higher EMP should yeild bigger LA, unless when higher EMP doesn't mean higher IR (Income Range), and that's a very intersting insight.
Negative correlation b/w APR and MP (-0.23). Lower APR means bigger LA, bigger LA means bigger MP - that's why there's a negative relationship b/w APR and MP.
What influences APR?
# plot matrix: sample 500 loans so that plots are clearer and they render faster
samples = np.random.choice(ld_u.shape[0], 500, replace = False)
loans_samp = ld_u.loc[samples,:]
g = sb.PairGrid(data = loans_samp, vars = numeric_vars)
g = g.map_diag(plt.hist, bins = 20)
g.map_offdiag(plt.scatter);
We see an obvious correlation b/w LA and MP, the bigger the loan, the bigger the monthly payment. Other variables don't show noticeable correlation between variables.
We can also see that there is no strength in the relatioships we raised earlier. Meaning that the levels of negative and positive correlation b/w the variables displayed in the previous heatmap are not evident.
# plot scatter plot: sample 500 loans so that plots are clearer and they render faster
samples = np.random.choice(ld_u.shape[0], 1000, replace = False)
loans_samp = ld_u.loc[samples,:]
# Scatter plot for EMP and LA
# plt.scatter(data = df, x = 'num_var1', y = 'num_var2')
plt.scatter(data = loans_samp, x = 'EmploymentStatusDuration', y = 'BorrowerAPR');
# plt.xlabel('EmploymentStatusDuration')
# plt.ylabel('BorrowerAPR');
No relationship can be seen, this means that the correlation from the heatmap is insignificant
# Using seaborn reg.plot with regression line
# sb.reg plot(data = df, x = 'num_var1', y = 'num_var2')
sb.regplot(data = loans_samp, x = 'EmploymentStatusDuration', y = 'BorrowerAPR');
No obvious relationship can be seen
# Checking supposed negative relationship b/w APR and LA (-0.3)
# Sample 500 loans so that plot is clearer and renders faster
samples = np.random.choice(ld_u.shape[0], 500, replace = False)
loans_samp = ld_u.loc[samples,:]
sb.regplot(data = loans_samp, x = 'LoanOriginalAmount', y = 'BorrowerAPR');
Here the negative correlation b/w the APR and the LA is presented by the regression line. The slope is not that steep and the points around it don't align to strengthen the support for this relationship.
However, another insight is coming to light. There seems to be a fixed relationship, where a paticular LA (10K, 15K, 20K...) is given out at different APRs.
Let's increase the sample.
# Checking supposed negative relationship b/w APR and LA (-0.3)
# Sample 500 loans so that plot is clearer and renders faster
samples = np.random.choice(ld_u.shape[0], 2000, replace = False)
loans_samp = ld_u.loc[samples,:]
sb.regplot(data = loans_samp, x = 'LoanOriginalAmount', y = 'BorrowerAPR');
Yes, we clearly see that LA = USD 25K has been approved at APR range of 0.07 to 0.34, and even though it is a pretty wide range, (considering the the full APR range examined is: 0.01 to 0.44) the ranges of APR get smaller as the LA increses and that is what is supporting the slight negative correlation b/w APR and LA.
# Checking supposed negative relationship b/w DIR and APR
# Sample 500 loans so that plot is clearer and renders faster
samples = np.random.choice(ld_u.shape[0], 500, replace = False)
loans_samp = ld_u.loc[samples,:]
plt.xticks([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
plt.yticks([0, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
sb.regplot(data = loans_samp, x = 'DebtToIncomeRatio', y = 'BorrowerAPR');
Shows a slight positive relationship, meaning higher DIR leads to higher APR. That can be understood in terms of risk.
# Sample 5000 loans so that plot is clearer and renders faster
samples = np.random.choice(ld_u.shape[0],5000, replace = False)
loans_samp = ld_u.loc[samples,:]
# Plot DIR vs LA
sb.regplot(data = loans_samp, x = 'LoanOriginalAmount', y = 'DebtToIncomeRatio');
Not much correlation, but there is a pattern which shows that most LAs are issued with DIR < 0.60. This can be understood because of risk. We can see a pattern that 35K LA stops at 0.4 DIR; 25K LA level has higer DIR on record (0.6); 15K LA registered (0.9) and so on. So we see a relationship: higher DIRs - lower LAs. This is logical: lendging large amounts to borrowers with high debt burdern is very risky and irresponsible.
Loans with "DebtToIncomeRatio" more than 60% are riskier than the ones with a smaller ratio.
Insigt: It will be good to check the "LoanStatus" and other characteristics of these riskier loans
This is an area of special attention. Debt Income ratio more than "1" means that the borrower has more debts than income, and that is problematic situation!
There is a slight positive correlation b/w DI and APR, not sure how significant is this level of correlation for our purposes.
# Sample 2000 loans so that plot is clearer and renders faster
samples = np.random.choice(ld_u.shape[0], 2000, replace = False)
loans_samp = ld_u.loc[samples,:]
sb.regplot(data = loans_samp, x = 'LoanOriginalAmount', y = 'EmploymentStatusDuration');
No evident correlation between EMP and LA
ld_u.EmploymentStatusDuration.median()
ld_u.EmploymentStatusDuration.describe()
ld_u.LoanOriginalAmount.median()
ld_u.LoanOriginalAmount.describe()
Not much correlation b/w variables, but again we see data point fall in stairghl line according to the distinct LA values (10K, 15K, etc...) This may mean that the loans are sold and packaged at these LAs.
We have an interesting insight: Even though most of the loans are in the range 0 to 140 EMP and up to 12K, there are outliers with 35K loans at 0 to 100 EMP levels. What characteristic of a borrower made the lender ingnore little or no employment history to give maximum LA to minimum EMP?
It will be interesting to investigate how the parameters of:
Numeric:
are afected by the categories of:
Categorical:
How:
Numeric:
differ according to the LS (Loan Status):
Categorical:
# Checking for categories
ld_u.EmploymentStatus.value_counts()
# Setting up 'EmploymentStatus' categorical variables
empstatus_vars = ['Employed', 'Full-time', 'Not available',
'Other', 'Self-employed', 'Part-time',
'Retired', 'Not employed']
estatuses = pd.api.types.CategoricalDtype(ordered = True,
categories = empstatus_vars)
ld_u['EmploymentStatus'] = ld_u['EmploymentStatus'].astype(estatuses);
base_color = sb.color_palette()[0]
sb.violinplot(data = ld_u, x = 'EmploymentStatus',
y = 'LoanOriginalAmount', color = base_color)
plt.xticks(rotation = 20);
In it is evedent from the plot that status "Employed" get more of larger LAs compared to the other statuses.
Status "Other" displays more 10K and 15K LAs. The long tail suggest that there're outliers all the way to 35K LA. Insight: very interesting to know, what is meant by "Other"? What additional documents the bank is asking to qualify borrowers with this status for a loan.
Status "Full-time" is in the third place with fewer LAs at 10K and 15K LAs, compare to "Other", but has more loans issued with 25K LA. It is not clear what statuses "Employed", "Full-time" and "Other" really mean to the bank. Why he issues more 10K and 15K LAs to "Other" than "Full-time", but more 25K LAs to "Employed" than "Other"
However it is evident that the group with less or no employment verification yeilds smaller LAs. "Not-available", "Part-time", "Retired" and "Not employed" have most LAs at 2.5K level, with tails that stop at 15K and 25K LAs. This is considerable lower than the: "Employed", "Full-time" and "Other" group with tails reaching 35K LAs
base_color = sb.color_palette()[0]
sb.violinplot(data = ld_u, x = 'EmploymentStatus',
y = 'BorrowerAPR', color = base_color)
plt.xticks(rotation = 20);
For the majority loans in the "low employment" categories ("Part-time", "Retired" and "Not employed" except for "Not-available"), the APR is lower. However, the minimum APR for this categories is higher than "Not-availabe". "Not-availabe" is capped 0.32 APR while the rest of the "low employment" cats rise beyond 0.4 APR surpassing the "high employment" cats. ("Employed", "Full-time", "Other").
"Not available" has a significant amount of loans with a high APR = 0.3%, but no outliers at the higher end, signalling that its a risk group. Similar situation with "Other" with many loans at APR = 0.35% and a very short bottom tail, meaning no low APRs for this category - risk group. However, it is not clear why this logic doesn't spread to the "low employment" cats.
Insight: Why do we have loans with APR = 0 or close to 0?
# Identify the minimum value
ld_u['BorrowerAPR'].min()
# Select low outliers and count them
out_apr = (ld_u['BorrowerAPR'] <= 0.1)
print(out_apr.sum())
Why is the bank giving out loans at almost 0% interest?
# Setting up the categoric variables
cat_vars = ['IncomeVerifiable', 'IncomeRange']
# Renaming the categories for visualisation
ld_u['IncomeRange'] = ld_u['IncomeRange'].replace('$1-24,999', '$1-24.9K')
ld_u['IncomeRange'] = ld_u['IncomeRange'].replace('Not displayed', 'ND')
ld_u['IncomeRange'] = ld_u['IncomeRange'].replace('$75,000-99,999', '$75K-99.9K')
ld_u['IncomeRange'] = ld_u['IncomeRange'].replace('$100,000+', '$100K+')
ld_u['IncomeRange'] = ld_u['IncomeRange'].replace('$50,000-74,999', '$50K-74.9K')
ld_u['IncomeRange'] = ld_u['IncomeRange'].replace('$25,000-49,999', '$25K-49.9K')
ld_u['IncomeRange'] = ld_u['IncomeRange'].replace('Not employed', 'NE')
# Setting up 'IncomeRange' categorical variables
incrange_vars = ['NE', 'ND', '$1-24.9K', '$25K-49.9K',
'$50K-74.9K', '$75K-99.9K', '$100K+']
incranges = pd.api.types.CategoricalDtype(ordered = True,
categories = incrange_vars)
ld_u['IncomeRange'] = ld_u['IncomeRange'].astype(incranges);
ld_u.IncomeRange.value_counts()
# plot matrix of numeric features against categorical features.
# can use a larger sample since there are fewer plots and they're simpler in nature.
samples = np.random.choice(ld_u.shape[0], 20000, replace = False)
ldu_samp = ld_u.loc[samples,:]
def boxgrid(x, y, **kwargs):
""" Quick hack for creating box plots with seaborn's PairGrid. """
default_color = sb.color_palette()[0]
sb.boxplot(x, y, color = default_color)
plt.figure(figsize = [8, 8])
g = sb.PairGrid(data = ldu_samp, y_vars = ['LoanOriginalAmount', 'BorrowerAPR'], x_vars = cat_vars,
size = 4, aspect = 1.5)
g.map(boxgrid)
plt.show();
ld.IncomeRange.value_counts()
IV and LA - from the plot it is obvious that verifiable provides for bigger LA, especially if we consider the level of outliers. This is logical, with proof of income, the lender is willing to lend more money.
IV and APR - the plot shows that with verifiable income the APR range is larger, as well as the IQR. This is very interesting, because higher risk (no verifiable income) should be reflected in higher APR. But very high APR on the risky group (no verifiable income) may raise the risk of default, so that maybe the reason for a smaller range in this group. Lender on one hand doesn't give low APR to this group on the other hand doesn't lend with high APRs either. According to the plot, "no income verification" status, raises the minimum APR and lowers the maximum APR for the borrower.
IR and LA - Interesting insight, ND (Not displayed) group has very large outliers (25K), reaching the levels for which qualifies 99K IR. What is the true meaning of ND status?
Some outliers in the "1-25K" group are also large (25K). But the main mass - in the IQR is up to 6K. "1-25K" IR has overall slightly lower LAs compared to "ND" IR. The "NE" (Not Employed) group has the lowest LAs, and that is understandable. The rest of the chart is self explanatory: The higher the IR, the higher the LA.
IR and APR - Interesting insight, minimum levels of APR that are available for "ND" and "25K-49.9K" IR are not availabe for the hiher IRs.
Insight: How does the lender justify higher APRs to the group with "verifiable income" (IV) and high IR?
How:
Numeric:
differ according to the LS (Loan Status):
Categorical:
base_color = sb.color_palette()[0]
sb.violinplot(data = ld_u, x = 'LoanStatus',
y = 'LoanOriginalAmount', color = base_color)
plt.xticks(rotation = 20);
Insight: "Defaulted" and "Chargedoff" LSs are almost identical in shape and parameters compared to the other LSs! This leads to assumption that these groups of borrowers possess similar characteristics.
base_color = sb.color_palette()[0]
sb.violinplot(data = ld_u, x = 'LoanStatus',
y = 'BorrowerAPR', color = base_color)
plt.xticks(rotation = 20);
Insight: "Current" and "Past Due" LSs don't have very high APR compared to "Defaulted" and "Chargedoff".
plt.figure(figsize = [12, 5])
base_color = sb.color_palette()[0]
# left plot: violin plot
plt.subplot(1, 2, 1)
ax1 = sb.violinplot(data = ld_u, x = 'LoanStatus', y = 'EmploymentStatusDuration', color = base_color)
# right plot: box plot
plt.subplot(1, 2, 2)
sb.violinplot(data = ld_u, x = 'LoanStatus', y = 'DebtToIncomeRatio', color = base_color);
#plt.ylim(ax1.get_ylim()) # set y-axis limits to be same as left plot
ld_u['DebtToIncomeRatio'].max()
Insight: it is very surprising that all categories of LS have loans that have DIs at maximum level of 0.99
ld_u.EmploymentStatus.value_counts()
ld_esx = ld_u.loc[(ld_u['EmploymentStatus']=='Employed') |
(ld_u['EmploymentStatus']=='Full-time') |
(ld_u['EmploymentStatus']=='Not available')]
ld_esx.head()
ld_esx.EmploymentStatus.value_counts()
#sb.countplot(data = ld_empx, x = 'EmploymentStatus', hue = 'IncomeRange');
ax = sb.countplot(data = ld_esx, x = 'EmploymentStatus', hue = 'IncomeRange')
ax.legend(loc = 7, ncol = 1, framealpha = 1, title = 'IncomeRange');
plt.xticks(rotation = 30);
The plot clearly shows that "Employed" status has more borrowers than "Full-time" with the majority in 50K-74.9K IR and 25K-49.9K IR in the second place. For "Full-time" borrowers the majority is in 25K-49.9K IR and 50K-74.9K IR in the second place.
# Analizing the other ESs
ld_esxy = ld_u.loc[(ld_u['EmploymentStatus']=='Not employed') |
(ld_u['EmploymentStatus']=='Retired') |
(ld_u['EmploymentStatus']=='Part-time') |
(ld_u['EmploymentStatus']=='Self-employed') |
(ld_u['EmploymentStatus']=='Other')]
ld_esxy.head()
#sb.countplot(data = ld_empx, x = 'EmploymentStatus', hue = 'IncomeRange');
ax = sb.countplot(data = ld_esxy, x = 'EmploymentStatus', hue = 'IncomeRange')
ax.legend(loc = 6, ncol = 1, framealpha = 1, title = 'IncomeRange');
plt.xticks(rotation = 30);
Status "Other" is showing the largest number of borrowers compared to the other ESs, with its primary IR of 25K-49.9K, which is almost twice bigger than the next 2 IRs, 1K-24.9K and 50K-74.9K. 25K-49.9K is also leading in the "Self-employed" ES, and 1K-24.9K is the main IR in the "Part-time" ES.
ld_u.IncomeVerifiable.value_counts()
Interesting group "False" to explore
# Exploring with faceting
bins = np.arange(0, 0.4+0.1, 0.1 )
g = sb.FacetGrid(data = ld_u, col = 'IncomeRange', col_wrap = 4, sharey = False)
g.map(plt.hist, 'BorrowerAPR')
# Exploring with faceting
g = sb.FacetGrid(data = ld_u, col_wrap = 4, col = 'IncomeRange')
g.map(plt.hist, 'LoanOriginalAmount');
Surprisingly
EMP(Length of employment of a borrower) has no effect on the APR or theLAvalue of the loan.Exploration showed that:
ES(Employment Status) of a borrower affects the Loan Amount (LA). HigherESqualifies for higherLAs. The opposite is true.Anylysis revealed that
EShas a differentiated relationship withAPR.For the majority loans in the "low employment" categories: "Part-time", "Retired" and "Not employed" (except for "Not-available"), the
APRis lower than the "higher employment" group: "Employed", "Full-time", "Other".However, the minimum APR for the "lower" categories is higher than "Not-availabe". "Not-availabe" is capped at 0.32
APR, while the rest of the "low employment" cats rise beyond 0.4APR, surpassing the "high employment" cats ("Employed", "Full-time", "Other")."Not available" has a significant amount of loans with a high
APR= 0.3%, but no outliers at the higher end, signalling that its a risk group. Similar situation with "Other" with many loans atAPR= 0.35% and a very short bottom tail, meaning no lowAPRsfor this category - risk group. However, it is not clear why this logic doesn't spread to the "low employment" cats.Exploration showed that borrowers with higher
DIR(Debt to Income Ratio) generally get higherAPRs.Analysis revealed that larger
LAsgenerally have smallerDIRs. This can be understood because of risk.Our analysis revealed that
IV(Income Verification) status has a positive relationship withLA. This is logical, with proof of income, the lender is willing to lend more money.Exploring relationship between
IVandAPRwe discovered that with verifiable income (VI) theAPRrange is larger, as well as the IQR, compared to noIV.This is very interesting, because higher risk (no verifiable income) should be reflected in higher
APRs(interest rates). But very highAPRon the risky group (no verifiable income) may raise the risk of default, so that maybe the reason for a smaller range in this group. Lender on one hand doesn't give lowAPRto this group on the other hand doesn't lend with highAPRseither. According to the plot, "no income verification" status, raises the minimum APR and lowers the maximumAPRfor the borrower.Our exploratory analysis showed that
IRandLAhave a positive relatioship: borrowers from higher income range (IR) quaify for higher loan amounts (LAs).According to our analysis there is no evident reationship between
IRandAPR. This means that borrower's income range (IR) has no effect on the APR (interest rate) he receives.Eploration of
ES(employment status) andIR(income range) relatioship provided understanding of the ranking of the ES according to the income size: Employed, Full-time, Other, Not Available, Self-employed, Part-time, Retired, Not-emploeyedWe also observed that larger
LAs(loan amounts) generally have smallerAPRs.
It is not clear what statuses "Employed", "Full-time" and "Other" really mean to the bank. Why he issues more 10K and 15K
LAsto "Other" than "Full-time", but more 25KLAsto "Employed" than "Other"Even within the "higher employment" group it is not clear why "Full-time" has a lower
APRthan "Employed", considering the fact that "Employed" displayed a higherIRthan "Full-time"Interesting insight, ND (Not displayed - low income range group) group has very large outliers with
LAs(loan amounts - 25K). These are levels for which 99KIR(high income range group) qualifies. What is the true meaning of ND status?Interesting insight, minimum levels of
APRthat are available for "ND" (Not Dispalyed - income range group) and "25K-49.9K"IR, are not availabe for the higherIRs.Insight: How does the lender justify higher
APRsto the group with "verifiable income" (IV) and highIR"income range"?
During Bivariate Exploration we have identified borrower's parameters (variables) that affect loan parameters: Loan Amount (LA) and APR (Annual Percentage Rate)
ES - Employment Status
DIR - Debt to Income Ratio
IR - Income Range
HOW - Home ownership
We've also raised a lot of questions during our bivariate exploration, all of which we will not attempt to answer within the scope of this project.
Our further Multivariate Exploration we will focus on visualizing the relatioship between borrower's parameters - ES, DIR, IR; and loan parameters - LA, APR, silmontaneously, to identify trends and insights.
We would also explore how the three categorical measures of LS (Loan Status) play into the relationship between loan parameters LA and APR.
We will also introduce a new borrower's parameter, variable HOW (IsBorrowerHomeowner) to our exploration. To see if affects loan parameters LA and APR.
ld_u.EmploymentStatus.value_counts()
samples = np.random.choice(ld_u.shape[0], 2000, replace = False)
loans_samp = ld_u.loc[samples,:]
g = sb.FacetGrid(data = loans_samp, hue = 'EmploymentStatus',
hue_order = ['Employed', 'Full-time'],
height = 6, aspect = 2)
g = g.map(sb.regplot, 'LoanOriginalAmount', 'BorrowerAPR', x_jitter = 0.04, fit_reg = False);
g.add_legend();
It is evident that "Employed" status receives bigger loans at lower interest compared to "Full-time" status.
ld_u.IncomeRange.value_counts()
samples = np.random.choice(ld_u.shape[0], 1000, replace = False)
loans_samp = ld_u.loc[samples,:]
g = sb.FacetGrid(data = loans_samp,
hue = 'IncomeRange',
height = 4, aspect = 1.5,
palette = 'viridis_r') #palette instead of hue_order
g = g.map(sb.regplot, 'LoanOriginalAmount', 'BorrowerAPR', x_jitter = 0.04, fit_reg = False);
g.add_legend();
It is clearly seen that "darker", 100K+ IR qulify for 15K LA onwards. And "lighter", lower IR qualify for smaller loan amounts. We also see that larger loans for high IR come with a lower APR than the lower ones.
samples = np.random.choice(ld_u.shape[0],700, replace = False)
loans_samp = ld_u.loc[samples,:]
plt.figure(figsize = [10, 8])
sb.regplot(data = loans_samp, x = 'LoanOriginalAmount', y = 'BorrowerAPR',
x_jitter = 0.1, fit_reg = False,
scatter_kws = {'s' : loans_samp['DebtToIncomeRatio']*300});
# plt.xlabel('Loan Amount in USD')
# plt.ylabel('Debt to Income Ratio')
# plt.legend(['Completed', 'Chargedoff'])
There is no visible trend or relationship on the plot above
# Checking for the different values in the IsBorrowerHomeowner column
ld.IsBorrowerHomeowner.value_counts()
samples = np.random.choice(ld_u.shape[0], 5000, replace = False)
loans_samp = ld_u.loc[samples,:]
g = sb.FacetGrid(data = loans_samp, col = 'LoanStatus', row = 'IsBorrowerHomeowner',
margin_titles = True)
g.map(plt.scatter, 'LoanOriginalAmount', 'BorrowerAPR');
"Current" LS shows that now homeowners are priveleged with higher LAs in comparison to "non-homeowners". "Completed" LS displays a similar relationship as above, but at lower LAs. "Chargedoff" is only slightly in favor of homeowners with respect to LAs. "Defaulted" LS is showing "non-homeowners" with LAs that are clustered around small values as opposed the homeowners where the defaults are spread accross smaller and larger LAs.
ld_u.IncomeRange.value_counts()
ld_u.LoanStatus.value_counts()
samples = np.random.choice(ld_u.shape[0],5000, replace = False)
loans_samp = ld_u.loc[samples,:]
ttype_markers = [['Current', 'o'],
['Chargedoff', '^']]
plt.figure(figsize = [10, 8])
for ttype, marker in ttype_markers:
plot_data = loans_samp.loc[loans_samp['LoanStatus'] == ttype]
sb.regplot(data = plot_data, x = 'LoanOriginalAmount', y = 'BorrowerAPR',
x_jitter = 0.04, fit_reg = False, marker = marker);
# plt.xlabel('Loan Amount in USD')
# plt.ylabel('Debt to Income Ratio')
# plt.legend(['Completed', 'Chargedoff'])
We can see the how the lender with the "Current" loans have used past experience and adjusted his lending strategy from "small" LAs < 1,5K to larger LAs > K2,5 LAs, and from high APRs > 0.36 to > 0.36 larger sums at smaller APRs.
ttype_markers = [['Current', 'o'],
['Defaulted', '^']]
plt.figure(figsize = [10, 8])
for ttype, marker in ttype_markers:
plot_data = loans_samp.loc[loans_samp['LoanStatus'] == ttype]
sb.regplot(data = plot_data, x = 'LoanOriginalAmount', y = 'BorrowerAPR',
x_jitter = 0.04, fit_reg = False, marker = marker);
# plt.xlabel('Loan Amount in USD')
# plt.ylabel('Debt to Income Ratio')
# plt.legend(['Completed', 'Chargedoff'])
Same shift away from smaller loans and high rates is shown in the graph above.
ttype_markers = [['Current', 'o'],
['Past Due', '^']]
plt.figure(figsize = [10, 8])
for ttype, marker in ttype_markers:
plot_data = loans_samp.loc[loans_samp['LoanStatus'] == ttype]
sb.regplot(data = plot_data, x = 'LoanOriginalAmount', y = 'BorrowerAPR',
x_jitter = 0.04, fit_reg = False, marker = marker);
# plt.xlabel('Loan Amount in USD')
# plt.ylabel('Debt to Income Ratio')
# plt.legend(['Completed', 'Chargedoff'])
The lender's shift towards larger loans: 20K to 35K, for lower rates has beared fruit with little or no delinquencies within the lower LAs, where most bad loans registered in the past. The strategy also paid of at the higher end with no delinquencies with the larges loans.
Multivariate Exploration supported the findings we got in Bivariate Exploration: the "Employed" status receives bigger loans at lower interest rates compared to "Full-time" and "Not available" employment statuses.
We can conclude that Employment Status (ES) affects parameters of the loan (LA and APR)In this exploration we again observed the relationship we found in Bivariate Exploration between IR, LA and APR. Similar to ES, high IRs receive higher LAs and lower APRs.
Here we can conclude that IR influences loan parameters similarly to ES This analysis showed us that currently, the bank favors homeowners with larger loans. Homeownership didn't reveal differences in parameters for "bad" loans. However, "Defaulted" LSs (which are more current in comparison to the "Chargedoff" LSs) showed the non-homeowners with mainly smaller LAs compared to homeowners. These observations perfectly play into the new strategy that the lender has adopted, which we will explain with the next exploration results.
We can conclude that homeownership affects the Loan Amount. However we didn't observe relationship with APRExploration showed how with the "Current" loans the lender has used past experience and adjusted his lending strategy from "small" LAs < 1,5K to larger LAs > 2,5K LAs, and from high > 0.36 to > 0.36 APRs. Tactics: Larger sums at smaller APRs.
Analysis showed the same shift away from smaller loans and high rates as above. This makes sense, since "Defaulted" is a more current LS than "Chalgedoff", reflecting the current risk management that we see in "Current" LSs.
Here the exploration displayed how the lender's shift towards larger loans: 20K to 35K, for lower rates has beared fruit with little or no delinquencies within the lower LAs, where most bad loans registered in the past. The strategy also paid off at the higher end with no delinquencies with the largest LAs.
We can conclude from these results that the lender is very thorough in analysing his lending products, and how the borrowers parameters fit into them.ld_u.shape